{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/NeuromatchAcademy/course-content/blob/master/tutorials/W1D3_ModelFitting/W1D3_Tutorial2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "# Neuromatch Academy: Week 1, Day 3, Tutorial 2\n",
    "# Model Fitting: Linear regression with MLE\n",
    "\n",
    "**Content creators**: Pierre-Étienne Fiquet, Anqi Wu, Alex Hyafil  with help from Byron Galbraith\n",
    "\n",
    "**Content reviewers**: Lina Teichmann, Madineh Sarvestani, Patrick Mineault, Ella Batty, Michael Waskom\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "#Tutorial Objectives\n",
    "\n",
    "This is Tutorial 2 of a series on fitting models to data. We start with simple linear regression, using least squares optimization (Tutorial 1) and Maximum Likelihood Estimation (Tutorial 2). We will use bootstrapping to build confidence intervals around the inferred linear model parameters (Tutorial 3). We'll finish our exploration of regression models by generalizing to multiple linear regression and polynomial regression (Tutorial 4). We end by learning how to choose between these various models. We discuss the bias-variance trade-off (Tutorial 5) and Cross Validation for model selection (Tutorial 6).\n",
    "\n",
    "In this tutorial, we will use a different approach to fit linear models that incorporates the random 'noise' in our data.\n",
    "- Learn about probability distributions and probabilistic models\n",
    "- Learn how to calculate the likelihood of our model parameters\n",
    "- Learn how to implement the maximum likelihood estimator, to find the model parameter with the maximum likelihood\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy import stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "#@title Figure Settings\n",
    "import ipywidgets as widgets       # interactive display\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "#@title Helper Functions\n",
    "def plot_density_image(x, y, theta, sigma=1, ax=None):\n",
    "  \"\"\" Plots probability distribution of y given x, theta, and sigma\n",
    "\n",
    "  Args:\n",
    "\n",
    "    x (ndarray): An array of shape (samples,) that contains the input values.\n",
    "    y (ndarray): An array of shape (samples,) that contains the corresponding\n",
    "      measurement values to the inputs.\n",
    "    theta (float): Slope parameter\n",
    "    sigma (float): standard deviation of Gaussian noise\n",
    "\n",
    "  \"\"\"\n",
    "\n",
    "  # plot the probability density of p(y|x,theta)\n",
    "  if ax is None:\n",
    "    fig, ax = plt.subplots()\n",
    "\n",
    "  xmin, xmax = np.floor(np.min(x)), np.ceil(np.max(x))\n",
    "  ymin, ymax = np.floor(np.min(y)), np.ceil(np.max(y))\n",
    "  xx = np.linspace(xmin, xmax, 50)\n",
    "  yy = np.linspace(ymin, ymax, 50)\n",
    "\n",
    "  surface = np.zeros((len(yy), len(xx)))\n",
    "  for i, x_i in enumerate(xx):\n",
    "    surface[:, i] = stats.norm(theta * x_i, sigma).pdf(yy)\n",
    "\n",
    "  ax.set(xlabel='x', ylabel='y')\n",
    "\n",
    "  return ax.imshow(surface, origin='lower', aspect='auto', vmin=0, vmax=None,\n",
    "            cmap=plt.get_cmap('Wistia'),\n",
    "            extent=[xmin, xmax, ymin, ymax])\n",
    "\n",
    "\n",
    "\n",
    "def solve_normal_eqn(x, y):\n",
    "  \"\"\"Solve the normal equations to produce the value of theta_hat that minimizes\n",
    "    MSE.\n",
    "\n",
    "    Args:\n",
    "    x (ndarray): An array of shape (samples,) that contains the input values.\n",
    "    y (ndarray): An array of shape (samples,) that contains the corresponding\n",
    "      measurement values to the inputs.\n",
    "    theta_hat (float): An estimate of the slope parameter.\n",
    "\n",
    "  Returns:\n",
    "    float: The mean squared error of the data with the estimated parameter.\n",
    "  \"\"\"\n",
    "  theta_hat = (x.T @ y) / (x.T @ x)\n",
    "  return theta_hat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Section 1: Maximum Likelihood Estimation (MLE)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 517
    },
    "colab_type": "code",
    "outputId": "68f35348-6538-4524-dda4-e150e08f5588"
   },
   "outputs": [],
   "source": [
    "#@title Video 1: Maximum Likelihood Estimation\n",
    "from IPython.display import YouTubeVideo\n",
    "video = YouTubeVideo(id=\"8mpNmzLKNfU\", width=854, height=480, fs=1)\n",
    "print(\"Video available at https://youtube.com/watch?v=\" + video.id)\n",
    "video"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "In the previous tutorial we made the assumption that the data was drawn from a linear relationship with noise added, and found an effective approach for estimating model parameters based on minimizing the mean squared error.\n",
    "\n",
    "In that case we treated the noise as simply a nuisance, but what if we factored it directly into our model?\n",
    "\n",
    "Recall our linear model:\n",
    "\n",
    "\\begin{align}\n",
    "y = \\theta x + \\epsilon.\n",
    "\\end{align}\n",
    "\n",
    "The noise component $\\epsilon$ is often modeled as a random variable drawn from a Gaussian distribution (also called the normal distribution).\n",
    "\n",
    "The Gaussian distribution is described by its [probability density function](https://en.wikipedia.org/wiki/Probability_density_function) (pdf)\n",
    "\\begin{align}\n",
    "\\mathcal{N}(x; \\mu, \\sigma^2) = \\frac{1}{\\sqrt{2\\pi\\sigma^2}}e^{-\\frac{1}{2\\sigma^2}(x-\\mu)^2}\n",
    "\\end{align}\n",
    "\n",
    "and is dependent on two parameters: the mean $\\mu$ and the variance $\\sigma^2$. We often consider the noise signal to be Gaussian \"white noise\", with zero mean and unit variance:\n",
    "\n",
    "\\begin{align}\n",
    "\\epsilon \\sim \\mathcal{N}(0, 1).\n",
    "\\end{align}\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Interactive Demo: Gaussian Distribution Explorer\n",
    "\n",
    "Use the explorer widget below to see how varying the $\\mu$ and $\\sigma$ parameters change the location and shape of the samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 494,
     "referenced_widgets": [
      "bfd70882f10945b096fe70d9ef1486cd",
      "2d7a17359c5d430abbc21d3e20d2e8a5",
      "4fac186a03454c3f95c25da40f71e19c",
      "5df5db2be4f049819d26d13ead27e41c",
      "d1d82ff4a0274afeb6afbd311ba2f526",
      "d0e0cb67d67c4550a0bcbabdf6c45455",
      "d99af8c857254508a1d4359f445f4f51",
      "470d2916aeb4493aa17cf5b467ebc90f",
      "4ca59a8b511b400d837f46949611c867",
      "791de5eb8c024e88bad22199a94c9be3"
     ]
    },
    "colab_type": "code",
    "outputId": "c6e875c3-a219-40f0-fdb4-51df8af1fa0c"
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "\n",
    "#@markdown Make sure you execute this cell to enable the widget!\n",
    "\n",
    "@widgets.interact(mu=widgets.FloatSlider(0.0, min=-2.0, max=2.0),\n",
    "                  sigma=widgets.FloatSlider(1.0, min=0.5, max=2.0))\n",
    "def plot_normal_dist(mu=0, sigma=1):\n",
    "\n",
    "  # Generate pdf & samples from normal distribution with mu/sigma\n",
    "  rv = stats.norm(mu, sigma)\n",
    "  x = np.linspace(-5, 5, 100)\n",
    "  y = rv.pdf(x)\n",
    "  samples = rv.rvs(1000)\n",
    "\n",
    "  # Plot\n",
    "  fig, ax = plt.subplots()\n",
    "  ax.hist(samples, 20, density=True, color='g', histtype='stepfilled', alpha=0.8,\n",
    "          label='histogram')\n",
    "  ax.plot(x, y, color='orange', linewidth=3, label='pdf')\n",
    "  ax.vlines(mu, 0, rv.pdf(mu), color='y', linewidth=3, label='$\\mu$')\n",
    "  ax.vlines([mu-sigma, mu+sigma], 0, rv.pdf([mu-sigma, mu+sigma]), colors='red',\n",
    "            color='b', linewidth=3, label='$\\sigma$')\n",
    "  ax.set(xlabel='x', ylabel='probability density', xlim=[-5, 5], ylim=[0, 1.0])\n",
    "  ax.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "\n",
    "## Section 1.1: Probabilistic Models\n",
    "\n",
    "Now that we have a model of our noise component $\\epsilon$ as random variable, how do we incorporate this back into our original linear model from before? Consider again our simplified model $y = \\theta x + \\epsilon$ where the noise has zero mean and unit variance $\\epsilon \\sim \\mathcal{N}(0, 1)$. We can now also treat $y$ as a random variable drawn from a Gaussian distribution where $\\mu = \\theta x$ and $\\sigma^2 = 1$:\n",
    "\n",
    "\\begin{align}\n",
    "y \\sim \\mathcal{N}(\\theta x, 1)\n",
    "\\end{align}\n",
    "\n",
    "which is to say that the probability of observing $y$ given $x$ and parameter $\\theta$ is\n",
    "\\begin{align}\n",
    "p(y|x,\\theta) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{1}{2}(y-\\theta x)^2}\n",
    "\\end{align}\n",
    "\n",
    "---\n",
    "\n",
    "Let's revisit our original sample dataset where the true underlying model has $\\theta = 1.2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "# @title\n",
    "\n",
    "# @markdown Execute this cell to generate some simulated data\n",
    "\n",
    "# setting a fixed seed to our random number generator ensures we will always\n",
    "# get the same psuedorandom number sequence\n",
    "\n",
    "np.random.seed(121)\n",
    "theta = 1.2\n",
    "n_samples = 30\n",
    "x = 10 * np.random.rand(n_samples) # sample from a uniform distribution over [0,10)\n",
    "noise = np.random.randn(n_samples) # sample from a standard normal distribution\n",
    "y = theta * x + noise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "This time we can plot the density of $p(y|x,\\theta=1.2)$ and see how $p(y)$ changes for different values of $x$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 285
    },
    "colab_type": "code",
    "outputId": "edaf91a5-9223-4c93-8d53-63bd34761c32"
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "#@markdown Execute this cell to visualize p(y|x, theta=1.2)\n",
    "\n",
    "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 4))\n",
    "\n",
    "# Invokes helper function to generate density image plots from data and parameters\n",
    "im = plot_density_image(x, y, 1.2, ax=ax1)\n",
    "plt.colorbar(im, ax=ax1)\n",
    "ax1.axvline(8, color='k')\n",
    "ax1.set(title=r'p(y | x, $\\theta$=1.2)')\n",
    "\n",
    "# Plot pdf for given x\n",
    "ylim = ax1.get_ylim()\n",
    "yy = np.linspace(ylim[0], ylim[1], 50)\n",
    "ax2.plot(yy, stats.norm(theta * 8, 1).pdf(yy), color='orange', linewidth=2)\n",
    "ax2.set(\n",
    "    title=r'p(y|x=8, $\\theta$=1.2)',\n",
    "    xlabel='y',\n",
    "    ylabel='probability density');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Section 1.2: Likelihood Estimation\n",
    "\n",
    "Now that we have our probabilistic model, we turn back to our original challenge of finding a good estimate for $\\theta$ that fits our data. Given the inherent uncertainty when dealing in probabilities, we talk about the [likelihood](https://en.wikipedia.org/wiki/Likelihood_function) that some estimate $\\hat \\theta$ fits our data. The likelihood function $\\mathcal{L(\\theta)}$ is equal to the probability density function parameterized by that $\\theta$:\n",
    "\n",
    "\\begin{align}\n",
    "\\mathcal{L}(\\theta|x,y) = p(y|x,\\theta) = \\frac{1}{\\sqrt{2\\pi\\sigma^2}}e^{-\\frac{1}{2\\sigma^2}(y-\\theta x)^2}\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "### Exercise 1: Likelihood Function\n",
    "\n",
    "In this exercise you will implement the likelihood function $\\mathcal{L}(\\theta|x,y)$ for our linear model where $\\sigma = 1$.\n",
    "\n",
    "After implementing this function, we can produce probabilities that our estimate $\\hat{\\theta}$ generated the provided observations. We will try with one of the samples from our dataset.\n",
    "\n",
    "TIP: Use `np.exp` and `np.sqrt` for the exponential and square root functions, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "def likelihood(theta_hat, x, y):\n",
    "  \"\"\"The likelihood function for a linear model with noise sampled from a\n",
    "    Gaussian distribution with zero mean and unit variance.\n",
    "\n",
    "  Args:\n",
    "    theta_hat (float): An estimate of the slope parameter.\n",
    "    x (ndarray): An array of shape (samples,) that contains the input values.\n",
    "    y (ndarray): An array of shape (samples,) that contains the corresponding\n",
    "      measurement values to the inputs.\n",
    "\n",
    "  Returns:\n",
    "    ndarray: the likelihood values for the theta_hat estimate\n",
    "  \"\"\"\n",
    "  sigma = 1\n",
    "  ##############################################################################\n",
    "  ## TODO for students: implement the likelihood function\n",
    "  # Fill out function and remove\n",
    "  raise NotImplementedError(\"Student exercise: implement the likelihood function\")\n",
    "  ##############################################################################\n",
    "\n",
    "  # Compute Gaussian likelihood\n",
    "  pdf = ...\n",
    "\n",
    "  return pdf\n",
    "\n",
    "\n",
    "# Uncomment below to test your function\n",
    "# print(likelihood(1.0, x[1], y[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "outputId": "6f1e6ea1-7a95-4b7f-902d-b7ee858b0d76"
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "def likelihood(theta_hat, x, y):\n",
    "  \"\"\"The likelihood function for a linear model with noise sampled from a\n",
    "    Gaussian distribution with zero mean and unit variance.\n",
    "\n",
    "  Args:\n",
    "    theta_hat (float): An estimate of the slope parameter.\n",
    "    x (ndarray): An array of shape (samples,) that contains the input values.\n",
    "    y (ndarray): An array of shape (samples,) that contains the corresponding\n",
    "      measurement values to the inputs.\n",
    "\n",
    "  Returns:\n",
    "    float: the likelihood value for the theta_hat estimate\n",
    "  \"\"\"\n",
    "  sigma = 1\n",
    "\n",
    "  # Compute Gaussian likelihood\n",
    "  pdf = 1 / np.sqrt(2 * np.pi * sigma**2) * np.exp(-(y - theta_hat * x)**2 / (2 * sigma**2))\n",
    "\n",
    "  return pdf\n",
    "\n",
    "\n",
    "print(likelihood(1.0, x[1], y[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "We should see that $\\mathcal{L}(\\theta=1.0|x=2.1,y=3.7) \\approx 0.11$. So far so good, but how does this tell us how this estimate is better than any others?\n",
    "\n",
    "When dealing with a set of data points, as we are with our dataset, we are concerned with their joint probability -- the likelihood that all data points are explained by our parameterization. Since we have assumed that the noise affects each output independently, we can factorize the likelihood, and write:\n",
    "\n",
    "\\begin{align}\n",
    "\\mathcal{L}(\\theta|X,Y) = \\prod_{i=1}^N \\mathcal{L}(\\theta|x_i,y_i),\n",
    "\\end{align}\n",
    "\n",
    "where we have $N$ data points $X = \\{x_1,...,x_N\\}$ and $Y = \\{y_1,...,y_N\\}$.\n",
    "\n",
    "In practice, such a product can be numerically unstable. Indeed multiplying small values together can lead to [underflow](https://en.wikipedia.org/wiki/Arithmetic_underflow), the situation in which the digital representation of floating point number reaches its limit. This problem can be circumvented by taking the logarithm of the likelihood because the logarithm transforms products into sums:\n",
    "\n",
    "\\begin{align}\n",
    "\\operatorname{log}\\mathcal{L}(\\theta|X,Y) = \\sum_{i=1}^N \\operatorname{log}\\mathcal{L}(\\theta|x_i,y_i)\n",
    "\\end{align}\n",
    "\n",
    "We can take the sum of the log of the output of our `likelihood` method applied to the full dataset to get a better idea of how different $\\hat\\theta$ compare. We can also plot the different distribution densities over our dataset and see how they line up qualitatively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 285
    },
    "colab_type": "code",
    "outputId": "2e6d6de4-c5a1-4371-a08b-a2df4eb27735"
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "#@markdown Execute this cell to visualize different distribution densities\n",
    "theta_hats = [0.5, 1.0, 2.2]\n",
    "fig, axes = plt.subplots(ncols=3, figsize=(16, 4))\n",
    "for theta_hat, ax in zip(theta_hats, axes):\n",
    "  ll = np.sum(np.log(likelihood(theta_hat, x, y)))  # log likelihood\n",
    "  im = plot_density_image(x, y, theta_hat, ax=ax)\n",
    "  ax.scatter(x, y)\n",
    "  ax.set(title=fr'$\\hat{{\\theta}}$ = {theta_hat}, log likelihood: {ll:.2f}')\n",
    "plt.colorbar(im, ax=ax);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Using the log likelihood calculation, we see that $\\mathcal{L}(\\theta=1.0) > \\mathcal{L}(\\theta=0.5) > \\mathcal{L}(\\theta=2.2)$.\n",
    "\n",
    "This is great: now we have a way to compare estimators based on likelihood. But like with the MSE approach, we want an analytic solution to find the best estimator. In this case, we want to find the estimator that maximizes the likelihood.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Section 1.3: Finding the Maximum Likelihood Estimator\n",
    "\n",
    "We want to find the parameter value $\\hat\\theta$ that makes our data set most likely:\n",
    "\n",
    "\\begin{align}\n",
    "\\hat{\\theta}_{\\textrm{MLE}} = \\underset{\\theta}{\\operatorname{argmax}} \\mathcal{L}(\\theta|X,Y) \n",
    "\\end{align}\n",
    "\n",
    "We discussed how taking the logarithm of the likelihood helps with numerical stability, the good thing is that it does so without changing the parameter value that maximizes the likelihood. Indeed, the $\\textrm{log}()$ function is *monotonically increasing*, which means that it preserves the order of its inputs. So we have:\n",
    "\n",
    "\\begin{align}\n",
    "\\hat{\\theta}_{\\textrm{MLE}} = \\underset{\\theta}{\\operatorname{argmax}} \\sum_{i=1}^m \\textrm{log} \\mathcal{L}(\\theta|x_i,y_i) \n",
    "\\end{align}\n",
    "\n",
    "Now substituting our specific likelihood function and taking its logarithm, we get:\n",
    "\\begin{align}\n",
    "\\hat{\\theta}_{\\textrm{MLE}} = \\underset{\\theta}{\\operatorname{argmax}} [-\\frac{N}{2} \\operatorname{log} 2\\pi\\sigma^2 - \\frac{1}{2\\sigma^2}\\sum_{i=1}^N (y_i-\\theta x_i)^2].\n",
    "\\end{align}\n",
    "\n",
    "Note that maximizing the log likelihood is the same as minimizing the negative log likelihood (in practice optimization routines are developed to solve minimization not maximization problems). Because of the convexity of this objective function, we can take the derivative of our negative log likelihhood, set it to 0, and solve - just like our solution to minimizing MSE.\n",
    "\n",
    "\\begin{align}\n",
    "\\frac{\\partial\\operatorname{log}\\mathcal{L}(\\theta|x,y)}{\\partial\\theta}=\\frac{1}{\\sigma^2}\\sum_{i=1}^N(y_i-\\theta x_i)x_i = 0\n",
    "\\end{align}\n",
    "\n",
    "This looks remarkably like the equation we had to solve for the optimal MSE estimator, and, in fact, we arrive to the exact same solution!\n",
    "\n",
    "\\begin{align}\n",
    "\\hat{\\theta}_{\\textrm{MLE}} = \\hat{\\theta}_{\\textrm{MSE}} = \\frac{\\sum_{i=1}^N x_i y_i}{\\sum_{i=1}^N x_i^2}\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "# Compute theta_hat_MLE\n",
    "theta_hat_mle = (x @ y) / (x @ x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 430
    },
    "colab_type": "code",
    "outputId": "e2c78f3f-63e2-4f78-a355-7d20db18e9c5"
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "#@markdown Execute this cell to visualize density with theta_hat_mle\n",
    "\n",
    "# Plot the resulting distribution density\n",
    "fig, ax = plt.subplots()\n",
    "ll = np.sum(np.log(likelihood(theta_hat_mle, x, y))) # log likelihood\n",
    "im = plot_density_image(x, y, theta_hat_mle, ax=ax)\n",
    "plt.colorbar(im, ax=ax);\n",
    "ax.scatter(x, y)\n",
    "ax.set(title=fr'$\\hat{{\\theta}}$ = {theta_hat_mle:.2f}, log likelihood: {ll:.2f}');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Summary\n",
    "\n",
    "- Likelihood vs probability\n",
    "    - $\\mathcal{L}(\\theta|x, y) = p(y|\\theta, x)$\n",
    "    - $p(y|\\theta, x)$ -> \"probability of observing the response $y$ given parameter $\\theta$ and input $x$\"\n",
    "    - $\\mathcal{L}(\\theta|x, y)$ -> \"likelihood model that parameters $\\theta$ produced response $y$ from input $x$\"\n",
    "- Log-likelihood maximization\n",
    "    - We take the $\\textrm{log}$ of the likelihood function for computational convenience\n",
    "    - The parameters $\\theta$ that maximize $\\textrm{log}\\mathcal{L}(\\theta|x, y)$ are the model parameters that maximize the probability of observing the data.\n",
    "- **Key point**:\n",
    "  - the log-likelihood is a flexible cost function, and is often used to find model parameters that best fit the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Appendix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "We can also see $\\mathrm{p}(\\mathrm{y} | \\mathrm{x}, \\theta)$ as a function of $x$. This is the stimulus likelihood function, and it is useful in case we want to decode the input $x$ from observed responses $y$. This is what is relevant from the point of view of a neuron that does not have access to the outside world and tries to infer what's out there from the responses of other neurons!\n",
    "\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "W1D3_Tutorial2",
   "provenance": [],
   "toc_visible": true
  },
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
